21 - Deep Learning - Regularization Part 5 [ID:16896]

50 von 57 angezeigt

Welcome everybody to deep learning. So today we want to conclude talking about different

regularization methods and we want to talk in particular about one more technique that

is called multitask learning. Multitask learning, we want to extend the previous concept. So

previously we only had one network for one task and then we had transfer learning to

reuse the network. But the question is can we do more? Can we do it in a better way?

And there are some real world examples. For example, if you learn to play the piano and

the violin, then in both tasks you require good hearing, sense of rhythm, music notation

and so on. So there are some things that can be shared. Or also soccer and basketball training.

Both require stamina, speed, body awareness, body-eye coordination. So if you learn the

one, then you typically also have benefits for the other. So this would be even better

than reusing. So you learn simultaneously and then provide a better understanding of

the shared underlying concepts. So the idea now is that we train and network simultaneously

on multiple related tasks. So we adapt the loss function to assess performance for multiple

tasks and this then gives a multitask learning that introduces a so-called inductive bias.

We prefer a model that can explain more than a single task. Also this reduces the risk

of overfitting on one particular task and our model generalizes better. So let's look

at the setup. So we have some shared input layers. So these are like the feature extraction

layers and the representation layers. And then we split at some point where we go into

task specific layers and then evaluate it on task A, task B, task C. And they may be

very different but somehow related because otherwise it wouldn't make sense to share

the previous layers. So several hidden layers are shared between all of the tasks. And as

already shown by Baxter in 97, multitask learning of n tasks reduces the chance of overfitting

by an order of n. Instead of hard sharing you can also do soft parameter sharing. Soft

parameter sharing would now introduce an additional loss. So you constrain the activations in

the particular layers to be similar. So each model has its own parameters but we somehow

link them together to perform similar extraction steps yet different extraction steps in the

constrained layers. And you can do that for example with an L2 norm or other norms that

make them similar. Now we still have to talk about the auxiliary tasks. So all of these

tasks should have an own purpose. You may also just include auxiliary tasks just because

you want to create a more stable network. So one example here is facial landmark detection

by Zhang and they essentially want to detect facial landmarks but this is impeded by occlusion

and pose variances. So they start simultaneously to learn landmarks and subtly related tasks

like the face pose, smiling, not smiling, glasses, no glasses, occlusion and gender.

So they had this information available and then you can set this up in a multitask learning

framework as you see here in the network architecture. And in the results you see that they then

have the auxiliary tasks here but they're actually interested in the facial landmarks

and they compare this to a CNN, a cascaded CNN and now their multitask network with the

auxiliary tasks and they can show that also the landmark detection is improved by introduction

of these auxiliary tasks. So certain features may be difficult to learn for one task but

it may be easier for a related one. So the auxiliary tasks can help to steer the training

in a specific direction and we somehow include prior knowledge by choosing appropriate auxiliary

tasks. And of course then tasks can have different convergence rates so you can then also introduce

task-based early stopping. An open research question is what tasks are appropriate auxiliary

tasks? So this is something we cannot generally recommend but is typically determined by experimental

validation. So next time on deep learning we start with a new blog where we look into

some practical recommendations to actually make things work. So with all you've seen

you're already pretty far but there's a couple of hints that will make your life easier so

definitely recommend watching the next couple of lectures and we will look into how to evaluate

performance and deal with the most common problems that essentially everybody has to

face in the beginning and we also look at concrete case studies with all the pieces

Teil einer Videoserie :

Deep Learning - Plain Version

Presenters

Prof. Dr.-Ing. Andreas Maier

Zugänglich über

Offener Zugang

Dauer

00:06:23 Min

Aufnahmedatum

2020-05-31

Hochgeladen am

2020-05-31 18:56:36

Sprache

en-US

Deep Learning - Regularization Part 5

This video discusses multi-task learning.

Further Reading:
A gentle Introduction to Deep Learning

Links:

Link - for details on Maximum A Posteriori estimation and the bias-variance decomposition
Link - for a comprehensive text about practical recommendations for regularization
Link - the paper about calibrating the variances

References:
[1] Sergey Ioffe and Christian Szegedy. “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”. In: Proceedings of The 32nd International Conference on Machine Learning. 2015, pp. 448–456.
[2] Jonathan Baxter. “A Bayesian/Information Theoretic Model of Learning to Learn via Multiple Task Sampling”. In: Machine Learning 28.1 (July 1997), pp. 7–39.
[3] Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Secaucus, NJ, USA: Springer-Verlag New York, Inc., 2006.
[4] Richard Caruana. “Multitask Learning: A Knowledge-Based Source of Inductive Bias”. In: Proceedings of the Tenth International Conference on Machine Learning. Morgan Kaufmann, 1993, pp. 41–48.
[5] Andre Esteva, Brett Kuprel, Roberto A Novoa, et al. “Dermatologist-level classification of skin cancer with deep neural networks”. In: Nature 542.7639 (2017), pp. 115–118.
[6] C. Ding, C. Xu, and D. Tao. “Multi-Task Pose-Invariant Face Recognition”. In: IEEE Transactions on Image Processing 24.3 (Mar. 2015), pp. 980–993.
[7] Li Wan, Matthew Zeiler, Sixin Zhang, et al. “Regularization of neural networks using drop connect”. In: Proceedings of the 30th International Conference on Machine Learning (ICML-2013), pp. 1058–1066.
[8] Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, et al. “Dropout: a simple way to prevent neural networks from overfitting.” In: Journal of Machine Learning Research 15.1 (2014), pp. 1929–1958.
[9] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. John Wiley and Sons, inc., 2000.
[10] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. http://www.deeplearningbook.org. MIT Press, 2016.
[11] Yuxin Wu and Kaiming He. “Group normalization”. In: arXiv preprint arXiv:1803.08494 (2018).
[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, et al. “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification”. In: Proceedings of the IEEE international conference on computer vision. 2015, pp. 1026–1034.
[13] D Ulyanov, A Vedaldi, and VS Lempitsky. Instance normalization: the missing ingredient for fast stylization. CoRR abs/1607.0 [14] Günter Klambauer, Thomas Unterthiner, Andreas Mayr, et al. “Self-Normalizing Neural Networks”. In: Advances in Neural Information Processing Systems (NIPS). Vol. abs/1706.02515. 2017. arXiv: 1706.02515.
[15] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. “Layer normalization”. In: arXiv preprint arXiv:1607.06450 (2016).
[16] Nima Tajbakhsh, Jae Y Shin, Suryakanth R Gurudu, et al. “Convolutional neural networks for medical image analysis: Full training or fine tuning?” In: IEEE transactions on medical imaging 35.5 (2016), pp. 1299–1312.
[17] Yoshua Bengio. “Practical recommendations for gradient-based training of deep architectures”. In: Neural networks: Tricks of the trade. Springer, 2012, pp. 437–478.
[18] Chiyuan Zhang, Samy Bengio, Moritz Hardt, et al. “Understanding deep learning requires rethinking generalization”. In: arXiv preprint arXiv:1611.03530 (2016).
[19] Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, et al. “How Does Batch Normalization Help Optimization?” In: arXiv e-prints, arXiv:1805.11604 (May 2018), arXiv:1805.11604. arXiv: 1805.11604 [stat.ML].
[20] Tim Salimans and Diederik P Kingma. “Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks”. In: Advances in Neural Information Processing Systems 29. Curran Associates, Inc., 2016, pp. 901–909.
[21] Xavier Glorot and Yoshua Bengio. “Understanding the difficulty of training deep feedforward neural networks”. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence 2010, pp. 249–256.
[22] Zhanpeng Zhang, Ping Luo, Chen Change Loy, et al. “Facial Landmark Detection by Deep Multi-task Learning”. In: Computer Vision – ECCV 2014: 13th European Conference, Zurich, Switzerland, Cham: Springer International Publishing, 2014, pp. 94–108.

Tags

Per RSS abonnieren